NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

https://doi.org/10.1093/bioinformatics/btaf370

Ludwig, II, David_W; Guptil, Christopher; Alexander, Nicholas_R; Zhalnina, Kateryna; Wipf, Edi_M_-L; Khasanova, Albina; Barber, Nicholas_A; Swingley, Wesley; Walker, Donald_M; Phillips, Joshua_L; et al (June 2025, Bioinformatics)

Abstract MotivationHigh-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias. ResultsAddressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model. Availability and implementationAll source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna.
more » « less
HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigs

https://doi.org/10.1093/bioinformatics/btac295

Du, Yuxuan; Sun, Fengzhu; Birol, ed., Inanc (April 2022, Bioinformatics)

Abstract MotivationMetagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample. ResultsWe develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs. Availability and implementationHiFine is available at https://github.com/dyxstat/HiFine. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
The K-mer File Format: a standardized and compact disk representation of sets of k -mers

https://doi.org/10.1093/bioinformatics/btac528

Dufresne, Yoann; Lemane, Teo; Marijon, Pierre; Peterlongo, Pierre; Rahman, Amatur; Kokot, Marek; Medvedev, Paul; Deorowicz, Sebastian; Chikhi, Rayan; Birol, ed., Inanc (July 2022, Bioinformatics)

Abstract SummaryBioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3–5× compared to other formats, and bringing interoperability across tools. Availability and implementationFormat specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
HolistIC: leveraging Hi–C and whole genome shotgun sequencing for double minute chromosome discovery

https://doi.org/10.1093/bioinformatics/btab816

Hayes, Matthew; Nguyen, Angela; Islam, Rahib; Butler, Caryn; Tran, Ethan; Mullins, Derrick; Hicks, Chindo; Birol, ed., Inanc (December 2021, Bioinformatics)

Abstract MotivationDouble minute (DM) chromosomes are acentric extrachromosomal DNA artifacts that are frequently observed in the cells of numerous cancers. They are highly amplified and contain oncogenes and drug-resistance genes, making their presence a challenge for effective cancer treatment. Algorithmic discovery of DM can potentially improve bench-derived therapies for cancer treatment. A hindrance to this task is that DMs evolve, yielding circular chromatin that shares segments from progenitor DMs. This creates DMs with overlapping amplicon coordinates. Existing DM discovery algorithms use whole genome shotgun sequencing (WGS) in isolation, which can potentially incorrectly classify DMs that share overlapping coordinates. ResultsIn this study, we describe an algorithm called ‘HolistIC’ that can predict DMs in tumor genomes by integrating WGS and Hi–C sequencing data. The consolidation of these sources of information resolves ambiguity in DM amplicon prediction that exists in DM prediction with WGS data used in isolation. We implemented and tested our algorithm on the tandem Hi–C and WGS datasets of three cancer datasets and a simulated dataset. Results on the cancer datasets demonstrated HolistIC’s ability to predict DMs from Hi–C and WGS data in tandem. The results on the simulated data showed the HolistIC can accurately distinguish DMs that have overlapping amplicon coordinates, an advance over methods that predict extrachromosomal amplification using WGS data in isolation. Availability and implementationOur software, named ‘HolistIC’, is available at http://www.github.com/mhayes20/HolistIC. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
DeepPASTA: deep neural network based polyadenylation site analysis

https://doi.org/10.1093/bioinformatics/btz283

Arefeen, Ashraful; Xiao, Xinshu; Jiang, Tao; Birol, ed., Inanc (April 2019, Bioinformatics)

Abstract MotivationAlternative polyadenylation (polyA) sites near the 3′ end of a pre-mRNA create multiple mRNA transcripts with different 3′ untranslated regions (3′ UTRs). The sequence elements of a 3′ UTR are essential for many biological activities such as mRNA stability, sub-cellular localization, protein translation, protein binding and translation efficiency. Moreover, numerous studies in the literature have reported the correlation between diseases and the shortening (or lengthening) of 3′ UTRs. As alternative polyA sites are common in mammalian genes, several machine learning tools have been published for predicting polyA sites from sequence data. These tools either consider limited sequence features or use relatively old algorithms for polyA site prediction. Moreover, none of the previous tools consider RNA secondary structures as a feature to predict polyA sites. ResultsIn this paper, we propose a new deep learning model, called DeepPASTA, for predicting polyA sites from both sequence and RNA secondary structure data. The model is then extended to predict tissue-specific polyA sites. Moreover, the tool can predict the most dominant (i.e. frequently used) polyA site of a gene in a specific tissue and relative dominance when two polyA sites of the same gene are given. Our extensive experiments demonstrate that DeepPASTA signisficantly outperforms the existing tools for polyA site prediction and tissue-specific relative and absolute dominant polyA site prediction. Availability and implementationhttps://github.com/arefeen/DeepPASTA Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less

Search for: All records